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ABSTRACT 



A modular system and method is provided for low bit rate 
encoding and decoding of speech signals using voicing 
probability determination. The continuous input speech is 
divided into time segments of a predetermined length. For 
each segment the encoder of the system computes a model 
signal and subtracts the model signal from the original signal 
in the segment to obtain a residual excitation signal. Using 
the excitation signal the system computes the signal pitch 
and a parameter which is related to the relative content of 
voiced and unvoiced portions in the spectrum of the exci- 
tation signal, which is expressed as a ratio Pv, defined as a 
voicing probabihty. The voiced and the unvoiced portions of 
the excitation spectrum, as determined by the parameter Pv, 
are encoded using one or more parameters related to the 
energy of the excitation signal in a predetermined set of 
frequency bands. In the decoder, speech is synthesized from 
the transmitted parameters representing the model speech, 
the signal pitch, voicing probability and excitation levels in 
a reverse order. Boundary conditions between voiced and 
unvoiced segments are established to ensure amplitude and 
phase continuity for improved output speech quality. Per- 
ceptually smooth transition between frames is ensured by 
using an overlap and add method of synthesis. LPC inter- 
polation and post-filtering is used to obtain output speech 
with improved perceptual quality. 

32 Claims, 10 Drawing Sheets 
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LOW BIT- RATE SPEECH CODING SYSTEM synthetic speech is most often generated as the output of a 

AND METHOD USING VOICING linear predictive coding (LPC) filter. Next, a residual, "exci- 

PROBABILITY DETERMINATION tation" signal is obtained by subtracting the synthetic model 

speech signal from the actual input signal. Generally, the 

This application is a continuation of application Ser. No. 5 dynamic range of the residual signal is much more limited, 

08/528,513, filed Sep. 13, 1995, now U.S. Pal. No. 5,774, so that fewer bits are required for its transmission and 

832, and claims the benefit of U.S. Provisional application storage. Finally, perceptually based minimization proce- 

Ser. No. 60/004,709, filed Oct. 3, 1995. dures can be employed to reduce the speech distortions at the 

synthesis end even further. 

BACKGROUND OF THE INVENTION lo Various techniques have been used in the past to design 

The present invention relates to speech processing and speech model filler, to form an appropriate excitation 

more specifically to a method and system for low bit rate signal and minimize the error between the origmal signal 

digital encoding and decoding of speech using separate and the synthesized output in some meaningful way. lliere 

processing of voiced and unvoiced components of speech ^ ^PP^ars to be a consensus, however, that no smgle technique 

signal segments on the basis of a voicing probability deter- ^ ^i^ely to succeed in aU applications. The reason for this is 

mination performance of digital compression and coding 

^. . " r - L J I u- w systems for voice signals is highly dependent on the speaker 

Digital encoding of voiceband speech has been subject to ^j^^,;^^ ^^^^^^ ^ 

mlensive research for at least three decades now, as a result ^j^^^^ ^ ^^^.^^^^ appUcaUon thus frequently 

of which vanous techniques have been developed targeting ^^^^^^ underlying signal model and 

different speech proce^ing appljcations at bit rates ranging ,h,''fl,^ibaity in adjusting the model parameters. As known 

from about 64 kb/s to about 2A kb/s. Two of the mam factors ^ ^ /^^^^ speech signal modek have been proposed 

which mfluence the choice of a particular speech processing asl 

algorithm are the desired speech quality and the bit rate. ^ ' . 

Generally, the lower the bit rate of the speech coder, i.e. ^ost frequently, speech is modeled on a short-time basis 

higher signal compression, the more the speech quality ^s the response of a hnear system excited by a penodic 

suffers to some extent. In each specific appUcation, it is thus ^V""^^ ^""ain for voiced sounds or random noise for the 

a matter of compromise between the desired speech quality, unvoiced sounds. For mathematical convemence, it is 

which in many instances is strictly specified, and the infor- assumed that the speech signal is stationary within a given 

mation capacity of the transmission channel and/or the 30 short time segment, so that the conlinuous speech is repre- 

speech processing system which determine the bit rate. The as an ordered sequence of distinct voiced and 

present invention is specificaUy directed to a low bit rate unvoiced speech segments, 

system and method for speech and voiceband coding to be Voiced speech segments, which correspond to vowels in 

used in speech processing and modem multimedia systems a speech signal, typically contribute most to the intelligibil- 

which require large volumes of data to be processed and 35 ity of the speech which is why it is important to accurately 

stored, often in real time, and acceptable quality speech to be represent these segments. However, for a low-pitched voice, 

delivered over narrowband communication channels. a set of more than 80 harmonic frequencies ("harmonics") 

For practical low bit rate digital speech signal measured within a voiced speech segment within a 

transformation, communication and storage purposes it is ^ kHz bandwidth. Clearly, encoding mformation about all 

necessary to reduce the amounts of data to be transmitted 40 harmomcsof such segment is only possible if a large number 

and stored by eliminating redundant information without ^f bits is used. Therefore, in applications where it is irapor- 

significant degradation of the output speech quality. There t^nt to keep the bit rate low, more sophisticated speech 

are some well known prior art speech signal compression i^odels need to be employed. 

and coding techniques which exploit signal redundancies to One typical approach is to separate the speech signal into 
reduce the required bit rate. Generally, these techniques can 45 its voiced and unvoiced components. The two components 
be classified as speech processing using analysis-and- are then synthesized separately and finally combined to 
synthesis (AAS) and analysis-by-synthesis (ABS) methods. produce the complete speech signal. For example, U.S. Pat. 
Although AAS methods, such as residual excited linear No. 4,771,465 describes a speech analyzer and synthesizer 
predictive coding (RELP), adaptive predictive coding (APC) system using a sinusoidal encoding and decoding technique 
and subband coding (SBC) have been successful at rales in 50 for voiced speech segments and noise excitation or multi- 
the range of about 9.6-16 kb/s, below that range they can no pulse excitation for unvoiced speech segments. In the pro- 
longer produce good quality speech. The reasons for that are cess of encoding the voiced segments a fundamental subset 
generally related lo the fact that: (a) there is no feedback of harmonic frequencies is determined by a speech analyzer 
mechanism to control the distortions in the reconstructed and is used lo derive the parameters of the remaining 
speech; and (b) errors in one speech frame generally propa- 55 harmonic frequencies. The harmonic amplitudes are deter- 
gate in subsequent frames without correction. In ABS mined from linear predictive coding (LPC) coeflScients, The 
schemes, on the other hand, both these factors are taken into method of synthesizing the harmonic spectral amplitudes 
account which enables them to operate much more success- from a set of LPC coefficients, however, requires extensive 
fully in the low bit rate range. computations and yields relatively poor quality speech. 

SpecificaUy, in ABS coding systems it is assumed that the 60 Different techniques focus on more accurate modeling of 

signal can be observed and represented in some form. Then, the excitation signal. The excitation signal in a speech 

a theoretical signal production model is assumed which has coding system is very important because it reflects residual 

a number of adjustable parameters to model different ranges information which is not covered by the theoretical model of 

of the input signal. By varying parameters of the model in the signal. This includes the pitch, long term and random 

a systematic way it is thus possible to find a set of parameters 65 patterns, and other factors which are critical for the intelli- 

that can produce a synthetic speech signal which matches gibility of the reconstructed speech. One of the most impor- 

the real signal with minimum error. In practical applications tant parameters in this respect is the is the determination of 
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ihe accurate pilch. Studies have shown that the human ear is 
more sensitive to changes in the pitch compared to changes 
in other speech signal parameters by an order of magnitude, 
which is why a number of techniques to accurately estimate 
the pitch have been proposed in the past. For example, U.S. 5 
Pat. Nos. 5,226,108 and 5,216.747 to Hardwick et al. 
describe an improved pitch estimation method providing 
sub-integer resolution. The quality of the output speech 
according to the proposed method is improved by increasing 
the accuracy of the decision as to whether given speech 10 
segment is voiced or unvoiced. This decision is made by 
comparing the energy of the current speech segment to the 
energy of the preceding segments. The proposed methods, 
however, generally do not allow accurate estimation of the 
amplitude information for all harmonics. 35 

In an approach related to the harmonic signal coding 
techniques discussed above, it has been proposed to increase 
the accuracy of the signal reconstruction by using a series of 
binary voiced/unvoiced decisions corresponding to each 
speech frame in what is known in the art as multiband 20 
excitation (MBE) coders. The MBE speech coders provide 
more flexibility in the selection of speech voicing compared 
with traditional vocoders, and can be used to generate good 
quality speech. In fact, an impnDved version of the MBE 
(IMBE) vocoder operating at 4.15 kb/s, with forward error 25 
correction (FEC) making it up to 6.4 kb/s, has been chosen 
for use in INMARSAT-M. In these speech coders, however, 
typically the number of harmonic magnitudes in the 4 kHz 
bandwidth varies with the fundamental frequency, requiring 
variable bit allocation for each harmonic magnitude from 30 
one frame to another, which can result in variable speech 
quality for different speakers. Another limitation of the 
IMBE coder is that the bit allocation for the model param- 
eters depends on the fundamental frequency, which reduces 
the robustness of the system to channel errors. In addition, 35 
errors in the voiced/unvoiced decisions, especially when 
made in the low frequency bands, result in perceptually 
objectionable degradation in the quality of the output 
speech. 

Therefore, it is perceived that there exists a need for more 
flexible methods for encoding and decoding of speech, 
which can be used in low bit rate applications. Accordingly, 
there is a present need to develop a modular system in which 
optimized processing of different speech segments, or 
speech spectrum bands, is performed in speciaUzed process- 45 
ing blocks to achieve best results for different types of 
speech and other acoustic signal processing applications. 
Furthermore, there is a need to more accurately classify each 
speech segment in terms of its voiced^nvoiced content in 
order to apply optimum signal compression for each type of 50 
signal. In addition, there is a need to obtain accurate esti- 
mates of the amplitudes of the spectral harmonics in voiced 
speech segments in a computationally efficient way and to 
develop a method and system to synthesize such voiced 
speech segments without the requirement to store or transmit 55 
separate phase information. 

SUMMARY OF THE INVENTION 

Accordingly, it is an object of the present invention to 
provide a modular system and method for encoding and 60 
decoding of speech signals at low to very low bit rates on the 
basis of a voicing probability determination. 

It is another object of the present invention to provide a 
novel encoder in which, following an analysis-by synthesis 
spectrum modeling, the voiced and the unvoiced portion of 65 
the excitation signal, as determined by the voicing probabil- 
ity of the frame, are processed separately for optimal coding. 
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It is yet another object of the present invention to provide 
a speech synthesizer which, on the basis of the voicing 
probability of the signal in each frame, synthesizes the 
voiced and the unvoiced portions of the excitation signal 
separately and combines them into a composite recon- 
structed excitation signal for the frame; the reconstructed 
excitation signal is then combined with the signal in adjacent 
speech segments with minimized amplitude and phase dis- 
tortions and passed through a model filter to obtain output 
speech of good perceptual quality. 

These and other objectives are achieved in accordance 
with the present invention by means of a novel modular 
encoder/decoder speech processing system in which the 
input speech signal is represented as a sequence of frames 
(time segments) of predetermined length. The spectrum 
S(w) of each such frame is modeled as the output of a linear 
time-varying filter which receives on input excitation signal 
with certain characteristics. Specifically, the time -varying 
fiher is assumed to be an all -pole filter, preferably an LPC 
filter with a pre-specified number of coefficients which can 
be obtained using the standard Levinson-Durbin algorithm. 
Next is constructed a synthetic speech signal spectrum using 
LPC inverse filtering based on the computed LPC model 
fiher coefficients. The synthetic spectrum is removed from 
the original signal spectrum to result in a generally flat 
excitation spectrum, which is then analyzed to obtain the 
remaining parameters required for the low bit rate encoding 
of the speech signal. For optimal storage and transmission 
the LPC coefficients are replaced with a set of corresponding 
line spectral frequencies (LSF) coefficients which have been 
determined for practical purposes to be less sensitive to 
quantization, and also lend themselves to intra -frame inter- 
polation. The latter feature can be used to further reduce the 
bit rate of the system. 

In accordance with a preferred embodiment of the present 
invention the excitation spectrum is completely specified by 
several parameters, including the pitch (the fundamental 
frequency of the segment), a voicing probability parameter 
which is defined as the ratio between the voiced and the 
unvoiced portions of the spectrum, and one or more param- 
eters related to the excitation energy in different parts of the 
signal spectrum. In a specific embodiment of the present 
invention directed to a very low bit rate system, a single 
parameter indicating the total energy of the signal in a given 
frame is used. 

In particular, the system of the present invention deter- 
mines the pitch and the voicing probability Pv for the 
segment using a specialized pitch detection algorithm. 
Specifically, after determining a value for the pitch, the 
excitation spectrum of the signal is divided into a number of 
frequency bins corresponding to frequencies harmonically 
elated to the pitch. If the normalized energy in a bin, i.e., the 
error between the original spectrum of the speech signal in 
the frame and the synthetic spectmm generated from the 
LPC inverse filter, is less than the value of a frequency- 
dependent adaptive threshold, the bin is determined to be 
voiced; otherwise the bin is considered to be unvoiced. The 
voicing probability Pv is computed as the ratio of the 
number of voiced frequency bins over the total number of 
bins in the spectrum of the signal. In accordance with a 
preferred embodiment of the present invention it is assumed 
that the low frequency portion of the signal spectrum 
contains a predominantly voiced signal, while the high 
frequency portion of the spectrum contains predominantly 
the unvoiced portion of the speech signal, and the boundary 
between the two is determined by the voicing probability Pv. 

Once the voicing probability Pv is determined, the speech 
segment is separated into a voiced portion, which is assumed 
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to cover a Pv portion in the low-end of the spectrum, and an 
unvoiced portion occupying the remainder of the spectrum. 
In a specific embodiment of the present invention directed to 
a very low bit rate system, a single parameter indicating the 
total energy of the signal in a given frame is transmitted. In 5 
an alternative embodiment, the spectrum of the signal is 
divided into two or more bands, and the average energy for 
each band is computed from the harmonic amphtudes of the 
signal that fall within each band. Advantageously, due to the 
dSerenl perceptual importance of different portions of the 
spectrum, frequency bands in the low end of the spectrum 
(its voiced portion) can be linearly spaced, while frequency 
bands in the high end of the spectrum can be spaced 
logarithmically for higher coding efiGciency. The computed 
band energies are then quantized for transmission. A param- 
eter encoder finally generates for each frame of the speech 
signal a data packet, the elements of which contain infor- 
mation necessary to restore the original speech segment. In 
a preferred embodiment of the present invention, a data 
packet comprises: control information, the LSF coeflScients 
for the model LPC filter, the voicing probability Pv, the 20 
pitch, and the excitation power in each spectrum band. 
Instead of transmitting the actual parameter values for each 
frame, in an alternative embodiment of the present invention 
only the differences from the preceding frames can be 
transmitted. The ordered sequence of data packets at the 25 
output of the parameter encoder is ready for storage or 
transmission of the original speech signal. 

At the synthesis end, a decoder receives the ordered 
sequence of data packets representing speech signal seg- 
ments. In a preferred embodiment, the unvoiced portion of 30 
the excitation signal in each time segment is reconstructed 
by selecting, dependent on the voicing probability Pv, of a 
codebook entry which comprises a high pass filtered noise 
signal. The codebook entry signal is scaled by a factor 
corresponding to the energy of the unvoiced portion of the 35 
spectrum. To synthesize the voiced excitation signal, the 
spectral magnitude envelope of the excitation signal is first 
re-constructed by linearly interpolating between values 
obtained from the transmitted spectrum band energy (or 
energies). This envelope is sampled at the harmonic fre- 40 
quencies of the pitch to obtain the amplitudes of sinusoids to 
be used for synthesis. The voiced portion of the excitation 
signal is finally synthesized from the computed harmonic 
amplitudes using a harmonic synthesizer which provides 
amplitude and phase continuity to the signal of the preceding 45 
speech segment. The reconstructed voiced and unvoiced 
portions of the excitation signal are combined to provide a 
composite output excitation signal which is finally passed 
through an LPC model filter to obtain a delayed version of 
the input signal. 50 

Several modifications to the basic algorithm described 
above can be used to enhance the performance of the system. 
For example, the frame by frame update of the LPC filter 
coefficients can be adjusted to take into account the temporal 
characteristics of the input speech signal. 55 

Specifically, in order to model frame transitions more 
accurately, the update rate of the analysis window can be 
adjusted adaptively. In a specific embodiment, the adjust- 
ment is done using frame interpolation of the transmitted 
LSFs. Advantageously, the LSFs can be used to check the 60 
stabihty of the corresponding LPC filter; in case the result- 
ing filter is unstable, the LSF coefiBcients are corrected to 
provide a stable filter. This interpolation procedure has been 
found to automatically track the formants and valleys of the 
speech signal from one frame to another, as a result of which 65 
the output speech is rendered considerably smoother and 
with higher perceptual quality. 
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In addition, in accordance with a preferred embodiment of 
the present invention a post-filter is used to further shape the 
excitation noise signal and improve the perceptual quaUty of 
the synthesized speech. The post-fiher can also be used for 
harmonic amplitude enhancement in the synthesis of the 
voiced portion of the excitation signal. 

Due to the separation of the input signal in different 
portions, it is possible to use the method of the present 
invention to develop different processing systems with oper- 
ating characteristics corresponding to user-specific apphca- 
tions. Furthermore, the system of the present invention can 
easily be modified to generate a number of voice effects with 
applications in various communications and multimedia 
products. 

BRIEF DESCRIPTION OF THE DRAWINGS 

The invention will be next be described in detail by 
reference to the following drawings in which: 

FIG. 1 is a block diagram of the speech processing system 
of the present invention. 

FIG. 2 is a schematic block diagram of the encoder used 
in a preferred embodiment of the system of the present 
invention. 

FIG. 3 illustrates in a schematic block-diagram form the 
decoder used in a preferred embodiment of the present 
invention. 

FIG. 4 is a flow-chart of the pitch detection algorithm in 
accordance with a preferred embodiment of the present 
invention. 

FIG. 5 is a flow-chart of the voicing probability compu- 
tation algorithm of the present invention. 

FIG. 6 shows in a flow-chart form the computation of the 
parameters of the LPC model filter. 

FIG. 7 shows in a flow-chart form the operation of the 
frequency domain post-filter in accordance with the present 
invention. 

FIG. 8 illustrates a method of generating the voiced 
portion of the excitation signal in accordance with the 
present invention. 

FIG. 9 illustrates a method of generating the unvoiced 
portion of the excitation signal in accordance with the 
present invention. 

FIG. 10 illustrates the frequency domain characteristics of 
the post-filtering operation used in accordance with the 
present invention. 

DETAILED DESCRIPTION OF THE 
INVENTION 

During the course of the description like numbers will be 
used to identify hke elements shown in the figures. Bold face 
letters represent vectors, while vector elements and scalar 
coefiBcients are shown in standard print. 

FIG. 1 is a block diagram of the speech processing system 
12 for encoding and decoding speech in accordance with the 
present invention. Analog input speech signal s(t) (15) from 
an arbitrary voice source is received at encoder 5 for 
subsequent storage or transmission over a communications 
channel 101. Encoder 5 digitizes the analog input speech 
signal 15, divides the digitized speech sequence into speech 
segments and encodes each segment into a data packet 25 of 
length I information bits. The ordered sequence of encoded 
speech data packets 25 which represent the continuous 
speech signal s(l) are transmitted over communications 
channel 101 to decoder 8. Decoder 8 receives data packets 
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25 in their original order to synthesize a digital speech signal encoder 5 of the system outputs for storage and transmission 

which is then passed to a digital- lo-analog converter to only a set of LPC coeflScients (or the related LSFs), repre- 

produce a time delayed analog speech signal 32, denoted seating the model spectrum for the signal, and the param- 

s(t-Tm), as explained in more detail next. The system of the eters of the excitation signal estimated in analysis block 40. 

present invention is described next with reference to a 5 A.1 Speech production model parameters 

specific preferred embodiment which is directed to process- accordance with a preferred embodmient of the present 

ing of speech at very low bit rates. invention the tmie-varymg filter modehng the spectrum of 

A TK p H signal IS an LPC filter. The advantage of usmg an LPC 

A. The bncoder model for spectral envelope representation is to obtain a few 

HG. 2 illustrates in greater detail the main elements of parameters that can be effectively quantized at low bit rates, 

encoder 5 and their interconnections in a preferred embodi- ^o determine these parameters, rather than minimizing the 

ment of a speech coder. Not shown in HG. 2, signal residual energy in the time domain, the goal is to fit the 

pre-processing is first applied, as known in the art, to original speech spectrum S^^(a)) to an all-pole model R(a)) 

facilitateencodingoftheinputspeech. In particular, analog ^^^^ ^^^^ ^^^^ between the two is minimized. The 

input speech signal 15 is low pass filtered to eliminate all-pole model can be written as 
frequencies outside the human voice range. The low pass 

filtered analog signal is then passed to an analog-to-digital _ g g (2) 

converter where it is sampled and quantized to generate a " " A(a)) ' p 

digital signal s(n) suitable for subsequent processing. ^ 

As known in the art, digital signal s(n) is next divided into nn , ^ , . ^ . . u r i ■ «u 

r r J * • J j ™ • i« « ^r^Ur^Ai where G is a earn factor, p is the number of poles in the 

frames of predetermined dimensions. In a specific embodi- wu<^i^ . f , . . ' ^ . . ^J-, ^^..^ 

ment of the present invention operating at 2.4 kb/s rate 211 ^Pf trum and A(a,) is known as the inverse LPC filter. The 

samples are tised to form one speech frame. In order to MSE error E.. between S Jo.) and R(a,) is given by 

minimize signal distortions at the transitions between adja- 2 n) 

cent frames a preset number of samples, in a specific ^5 ^'^(^ ^ ^ / I^q^N! \ 

embodiment, about 60 samples from each frame overlap '°o>— A^^ ° o)— .v/2 \ / 

with the adjacent frame. In a preferred embodiment, the . . • • u 

separation of the input signal into frames is accompUshed The parameters {aj are then determined by minimizmg the 

using a circular buffer, which is also used to set the lag error E, with respect to each a^, parameter. As known m the 

between different frames and other parameters of the pre- 3^ art, the solution to this minimization problem is given by the 

processing stage of the system. following set of equations: 

In accordance with a preferred embodiment of the present ^^-j 

invention, the spectrum S(a)) of the input speech signal in a 2 = -/?,; i^i^p 
frame of a predetermined length is represented using a 

speech production model in which speech is viewed as the 35 where 
result of passing a substantially flat excitation spectrum E(a)) 

through a linear time-varying filter H(a),t), which models the /{^ = T | Su^Nl^cosCM 

resonant characteristics of the speech spectral envelope as: (o— iv/2 ' 

5((o)-£(a))//((o^) (1) where 

Equation (4) represents a set of p linear equations in p 

In accordance with a preferred embodiment of the present unknowns which may be solved for {a^^} using the Levinson- 

invention the time-varying filter in Eq. (1) is assumed to be j^^^^^ algorithm, as shown in FIG. 6. This algorithm is well 

an aU-pole filter, preferably a LPC filter with a predeter- ^^^^^ ^ described, for example, in S. J. 

mined number of coefficients. It has been found that for Qrphanidis, "Optimum Signal Processing," McGraw Hill, 

practical purposes an LPC filter with 10 coefficients is 45 ^ew York, 1988, pp. 202-207. which is hereby incorporated 

adequate to model the spectral shape of human speech reference. In a preferred embodiment of the present 

signals. On the other hand, in accordance with the present invention the number p of the preceding speech samples 

invention the excitation spectrum E(co) in Eq. (1) is specified ^ prediction is set equal to about 6 to 10. Similarly, 

by a set of parameters including the signal pitch, the ^^^^^ ^^^^ ^^^^ parameter G can be calculated as: 

excitation RMS values in one or more frequency bands, and 50 

a voicing probability parameter Pv, as discussed in more p (6) 

detail next. G^^R,^l^aji, 

More specifically, with reference to FIG. 2, the speech 
production model parameters (LPC filter coefficients) are A.2 Excitation Model Parameters 
estimated in LPC analysis block 20 in order to minimize the 55 As the LPC spectrum is a close estimate of the spectral 
mean squared error (MSE) between the original spectrum envelope of the speech spectrum, its removal is bound to 
S^^(a)) and the synthetic spectrum §((1)). After computing the result in a relatively flat excitation signal. Notably, the 
coefficients of the LPC filter, the input signal is inverse information content of the excitation signal is substantially 
filtered in block 30 to subtract the synthetic spectrum from uniform over the spectrum of the signal, so that estimates of 
the original signal spectrum, thus forming the excitation 60 the residual information contained in the spectrum are 
spectrum E(a)). The parameters used in accordance with the generally more accurate compared to estimates obtained 
present invention to represent the excitation spectrum of the directly from the original spectrum. As indicated above, the 
signal are then estimated in excitation analysis block 40. As residual information which is most important for the pur- 
shown in FIG. 2, these parameters include the pilch Pq of the poses of optimally coding the excitation signal comprises 
signal, the voicing probability for the segment and one or 65 the pitch, the voicing probability and the excitation spectmm 
more spectrum band energy coefficients Ej^ Thus, in accor- energy parameters, each one being considered in more detail 
dance with a preferred embodiment of the present invention next. 
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1\iniing next to FIG. 4, it shows a flow-chart of the pitch the FFT computation increases if the length N of the 
detection algorithm in accordance with a preferred embodi- transform is a power of 2, i.e. if N»2^. Accordingly, in a 
ment of the present invention. Pitch detection plays a critical specific embodiment of the present invention the length N of 
role in most speech coding applications, especially for low the speech vector is initially adjusted by adding zeros to 
bit rate systems, because the human ear is more sensitive to 5 meet this requirement, 
changes in the pitch compared to changes in other speech A.2.1 j itchEstima tion 

signal parameters by an order of magnitude. Typical prob- in accordance wun a prenerred embodiment of the present 
lems include mistaking submultiples of the pitch for its invention estimation of the pitch generally involves.^-twa=. 
correct value in which case the synthesized output speech step proce ss. In t he first step, the spectrum of the input si gnal 
will have multiple times the actual number of harmonics. lO -^jq.i ^^p^^fl 'he. "pjtch rate" f^^ is used tn cpmpute a 
The perceptual effect of making such a mistake is having a rou^^^ timate of the pitch Fn. In the second_st ep_oLthp 
male voice sound like female. Another significant problem proce ss the pitch estimate is refined usinfi a -Spectrnm^^Lthe 
is ensuring smooth transitions between the pitch estimates in sig nal sampled at a higher regular sampling frequenc y f,. 
a sequence of speech frames. If such transitions are not Preferably, the pitch estimates in a sequence of frames are 
smooth enough, the produced signal exhibits perceptually 15 also refined using backward and forward tracking pitch 
very objectionable signal discontinuities. Therefore, due to smoothing algorithms which correct errors for each pitch 
the importance of the pitch in any speech processing system, estimate on the basis of comparing it with estimates in the 
its estimation requires a robust, accurate and reliable com- adjacent frames. In addition, the voicin g probability Pv of 
putation method. In accordance with a preferred embodi- the a dj^acent se g ments, discussed in m o rc, detail next , is also 
ment of the present invention the pitch detector used in block 20 usedln a preferred embodiment of the invention to define the 
20 of the encoder 5 operates in the frequency domain. -s^ peof the search in the pitch tracking algorithm. 

Accordingly, with reference to FIG. 2, the first function of — "More specifically, with reference to FIG. 4, at step 200 of 
block 40 in the encoder 5 is to compute the signal spectrum the method an N-point FFT is performed on the signal 
S(k) for a speech segment, also known as the short time sampled at the pitch sampling frequency f^. As discussed 
spectrum of a continuous signal, and supply it to the pitch 25 above, prior to the FFT computation the input signal of 
detector. The computation of the short time signal spectrum length N is windowed using preferably a Kaiser window of 
is a process well known in the art and therefore will be length N. 

discussed only briefly in the context of the operation of In the foHowing step 210 are computed the spectral 
encoder 5. magnitudes M and the total energy E of the spectral com- 

Specifically, it is known in the art that to avoid disconti- 30 ponents in a frequency band in which the pitch signal is 
nuities of the signal at the ends of speech segments and normally expected. Typically, the upper limit of this expec- 
problems associated with spectral leakage in the frequency lalion band is assumed to be between about 1.5 to 2 kHz. 
domain, a signal vector y^ containing samples of a speech Next, in step 220 are determined the magnitudes and loca- 
segment should be multiplied by a pre-specified window w tions of the spectral peaks within the expectation band by 
to obtain a windowed speech vector y^j^. The specific 35 using a simple routine which computes signal maxima. The 
window used in the encoder 5 of the present invention is a estimatedpea k ampfitu des and tto jocations are desi gnated 
Hamming or a Kaiser window, the elements of which are as~{A,-, W/I^Trespegiyelvjwhere L is the number of peaks 

scaled to meet the constraint: i n the expec tation band.. 

' 'llie search for the optimal pitch candidate among the " 
J M-i (7) 40 peaks determined in step 220 is performed in the following 

1-77 j^^^^^^^ step 230. Conceptually, this search can be thought of as 

defining for each pitch candidate of a comb-filter comprising 
The use of Kaiser and Hamming windows is described for the pitch candidate and a set of harmonically related ampli- 
example in Oppenheim et al., "Discrete Time Signal tudes. Next, the neighborhood around each harmonic of each 
Processing," Prentice Hall, Englewood Hills, NJ., 1989. For 45 comb filter is searched for an optimal peak candidate, 
a Kaiser window Wj^ elements of vector yy^rM are given by Specifically, within a pre-specified search distance d 
the expression: around the harmonics of each pitch candidate, the maxima 

of the acmal speech signal spectrum are checked to deter- 
yw^^ny^M^y{n)\ n=o,i,2, . . . ,M-i (8) optimum spectral peak. A suitable formula used in 

The input windowed vector y^^ is next padded with 50 accordance with the present invention to compute the opti- 
zeros to generate a vector y^ of length N defined as foUows: peak is given by the expression: 

(10) 

where is weighted peak amplitude for the k-th harmonic; 
55 A; is the i-th peak amplitude and d(w,-, kwj is an appropriate 
The zero padding operation is required in order to obtain distance measure between the frequency of the i-th peak and 
an alias-free version of the discrete Fourier transform (DFT) the k-th harmonic within the search distance. A number of 
of the windowed speech segment vector, and to obtain functional expressions can be used for the distance measure 
spectrum samples on a more finely divided grid of frequen- d(w,-, kw^). Preferably, two distance measures, the perfor- 
cics. It can be appreciated that dependent on the desired 60 mance of which is very similar, can be used: 
frequency separation, a different number of zeros may be 

appended to windowed speech vector y^^, ^ ■ ^^'^i^o) - cosli-iK - ^wj] (n a) 

Following the zero padding, a N point discrete Fourier sinl^jiK - ^h-^)] (iib) 

transform of speech vector yj^ is performed to obtain the 2:d(wf^^)- — 2My^i - kw ) — 

corresponding frequency domain vector F^. Preferably, the 65 

computation of the FFT is executed using any fast Fourier In accordance with the present invention the determina- 
transform (FFT) algorithm. As well known, the efficiency of tion of an optimum peak depends both on the distance 



>A<«) - ywuin)toTn'Q,...,M-l (9) 
- 0 for « - Af, . . . , - 1 
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function d(w^-, kwj and the peak amplitudes within the 
search distance. Therefore, it is conceivable that using such 
function an optimum can be found which does not corre- 
spond to the minimum spectral separation between a pitch 
candidate and the spectrum peaks. 

Once all optimum peak amphtudes corresponding to each 
harmonic of the pitch candidates are obtained, a normalized 
rm sK-rnrrelatmn function is computed between the fre - 
quency res |x>nsc of each comb-filter and the determ ined 
opti mum^eak amplitudes for a set of speech frames in 
accordance with the expression: 



H 
Jk-i 



«0 



L 

/-I 



H 
L 



where -2^Ft^3 and h;^ ^^^e the harmonic amplitudes of the 
teeth of comb-filter, H is the number of harmonic 
amplitudes, and n is a pitch lag which can vary. Xhe second 
term in the equation above is a bias factor, an energy ratio 
Detweeu luiujuuii. amplitude s and p on k amplitudc sT-ttrat 
reduceii the probabihty ot encountering a pitch doubhng 
„p[Ublt^m. ^ 



10 



(12) 



25 



40 



45 



In a preferred embodiment of the present invention the 
pitch of frame Fr^ is estimated using backward and forward 
pitch tracking to maximize the cross-correlation values from 
one frame to another which process is summarized as 
follows: blocks 240 and 250 in FIG. 4 represent respectively 
backward pitch tracking and lookahead pitch tracking which ^ 
can in be used in accordance with a preferred embodiment 
of the present invention to improve the perceptual quality of 
the output speech signal. Th e^principlc of pitch tracking is 35 
ba sed on the continuity characteristic of the pitch, i.e. t he 
property of a speech signal that once a voiced signal i s 
estabCsEed, its pitch varies only within a limited ran ge. 
(Tliis property was used in establis&ing the search range for 
the pitch in the next signal frame, as described above). 
Generally, pitch tracking can be used both as an error 
checking function following the main pitch determination 
process, or as a part of this process which ensures that the 
estimation follows a correct, smooth route, as determined by 
the continuity of the pitch in a sequence of adjacent speech 
segments. 

In a specific embodiment of the present invention, the 
pitch Pj of frame Fj is estimated using the following 
procedure. Considering first the backward tracking 
mechanism, in accordance with the pitch continuity 
assumption, the pitch period is searched in a limited ra nge 
around the pit ch value Pp Srthe pr ecedin g frame Fp . This 
condition is expressed mathematically as follows: 

where a determines the range for the pitch search and is 
typically set equal to 0.25. The cross-correlation function 
R^(P) for firame F^, as defined in Eq. (12) above, is consid- 
ered at each value of P which falls within the defined pitch 
range. Next, the values Ri(P) for all pitch candidates in the 
range given above are compared and a backward pitch 
estimate P^ is determined by maximizing the Ri(P) function 
over all pitch candidates. The average cross-correlation 
values for the backward frames are then computed using the 
expression: 



12 



I RiiPb) 



-(M-1) 
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where P,-, Rff,) are the pitch estimates and corresponding 
cross-correlation functions for the previous (M-1) frames, 
respectively. 

Turning next to the forward tracking mechanism, it is 
again assumed that the pitch varies smoothly between 
frames. Since the pitch^has not yet been determined for the 
M-1 future frames, the forward pitch tracking algorithm 
selects the optimum pitch for these frames. This is done by 
first restricting the pitch search range, as shown above. Next, 
assuming that P^ is fixed, the values of the pitch in the fumre 
frames {P,>i}^~^ are determined as to maximize the cross- 
correlation functions {R|^i(P)}^"^ in the range. Once the set 
of values {P, }^"^ has been determined, the forward average 
cross-correlation function, Cj(?) is calculate d, as in the case 
of backward-lrackinp usin g the expr ession: 



55 



60 



cm- 



[{M-1) 
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(14) 



This process is repeated for each pitch candidate. The 
corresponding values of C/P) are compared and the forward 
pitch, Py is chosen which results in the maximum value of 
C/P) function. The maximum backw ard cross-correla tion 
cJ^Pfr) is finally compared agai nst the nia xuaumlforward 
average cros s-correlation and the larger value is used to 
determines the optimum pitch P^ 



[n an alternative embodiment of the present invention, the 
search for the optimum pilch candidate uses the voicing 
probability parameter Pv for the previous frame. (The voic- 
ing probability parameter is discussed in more detail in the 
following section). In particular, Pv is compared against a 
pre-specified threshold and if it is larger than the threshold, 
it is assumed that the previous frame was predominantly 
voiced. Because of the continuity characteristic of the pitch, 
it is assumed that its value in the present frame will remain 
close to the value of the pitch in the preceding frame. 
Accordingly, the pitch search range can be limited to a 
predefined neighborhood of its value in the previous frame, 
as described above. Alternatively, if the voicing probability 
Pv of the preceding frame is less than the defined threshold, 
it is assumed that the speech frame was predominantly 
unvoiced, so that the pitch period in the present frame can 
assume an arbitrary value. In this case, a full search for all 
potential pitch candidates is performed. 

The mechanism for pitch tracking described above is 
related to a specific embodiment of the present invention. 
Alternate algorithms for pitch tracking are known in the 
prior art and will not be considered in detail. Useful discus- 
sion of this topic can be found, for example, in A. M. 
Kondoz, "Digital Speech: Coding for l^w Bit Rate Com- 
munication Systems," John Wiley & Sons, 1994, the rel- 
evant portions of which are hereby incorporated by refer- 
ence for all purposes. 

With reference to FIG. 4, finally, in step 260 a check is 
made whether the estimated pitch is not in fact a submultiple 

of the actual p itch. 

--=:AT2:2Titch Sub-Multiple Check 

The sub-multiple check algorithm in accordance with the 
present invention can be summarized as foUows: 

1. Integer and sub-multiples of the estimated pitch are first 
computed to generate the ordered list 



V 
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A.2.4 Voicing Determination 

(Pi Pi Pi \ Traditional speech processing algorithms classify each 

~2~ ' 3 ' ■ ■ • ' rt / Speech frame either as purely voiced or unvoiced based on 

some pre-specified fixed decision threshold. Recently, in 

2. ITie average harmonic energy for each sub-mulliple ' multiband excitation (MBE) voooders. the speech spectrum 
candidate is computed using the expression: °f '^e signal was modeled as a combmation of both unvoiced 

and voiced portions of the speech signal by dividing the 
(15) speech spectrum into a number of frequency bands and 
= — I v42(i • wi); Jfe - 1, 2, . . . , n making a binary voicing decision for each band. In practice, 

* ■'^ however, this technique is inefficient because it requires a 

where is the number of harmonics, A(i- W;J are harmonic large number of bits to represem the voicing information for 
magnitudes and ^ach band of the speech spectrum. Another disadvantage of 

this multiband decision approach is that since the voicing 
2n J 5 determination is not always accurate and voicing errors, 

^^""hjT especially when made in low frequency bands, can result in 

output signal buzziness and other artifacts which are per- 
is the frequency of the k'" sub-multiple of the pitch. The ratio ceptually objectionable to listeners. 

between the energy of the smallest sub-multiple and the In accordance with the present invention, a new method 

energy of the first sub-multiple, P,-, is then calculated and is 20 is proposed for representing voicing information efficiently, 
compared with an adaptive threshold which varies for each Specifically, in a preferred embodiment of the method it is 
sub-multiple. If this ratio is larger than the predetermined assumed that the low frequency components of a speech 
threshold, the sub-multiple candidate is selected as the signal are predominantly voiced and the high frequency 
actual pitch. Otherwise, the next largest sub-multiple is components are predominantly unvoiced. The goal is then to 
checked. This process is repeated until all sub-multiples 25 find a border frequency that separates the signal spectrum 
have been tested, into such predominantly low frequency components (voiced 

^ - . . . . , ..t;of„ speech) and predominantly high frequency components 

3, If none of the sub-muhiples of the pitch sat^fy the ^ ^ ^ ^^^^^^ ^^^^^ ^^^^ ^^^^ ^^^^^ 

condition m step 2, the ratio r given m the following ^ ^ ^^^^^ ^^^^ 

expression is computed, ^^^^^^ ^^^^ ^^^^^^^^^ accordance with a prefened 

embodiment of the present invention the concept of voicing 
probability Pv is introduced. The voicing probability Pv 
generally reflects the amount of voiced and unvoiced com- 
^1(^1) ponenls in a speech signal. Thus, for a given signal frame 

The ratio r is then compared with another adaptive 35 Pv=0 indicates that there are no voiced components in the 
threshold which varies for each sub-multiple. If r is larger frame; Pv=l indicates that there are no unvoiced speech 
than the corresponding threshold, it is selected as the actual components; the case when Pv has a value between 0 and 1 
pitch, otherwise, this process is iterated until all sub- reflects the more common simation in which a speech 
multiples are checked. If none of the sub-multiples of the segment is composed of a combination of both voiced and 
initial pitch satisfy the condition, then Pj is selected as the 40 unvoiced signal portions, the relative amounts of which are 
pitch estimate. expressed by the value of the voicing probability Pv. 

A.2.3 Pitch Smoothing Notably, unlike standard subband coding schemes in which 

In accordance with a preferred embodiment of the present the signal is segmented in the frequency domain into bands 
invention the pitch is estimated at least one frame in having fixed boundaries, in accordance with the present 
advance. Therefore, as indicated ab nve., if ii a p nssiM? its^ 45 invention the separation of the signal into voiced and 
pitch trar-inna al^;nfitl7 nis tn STT^^^^^ p^'*^^ JPu n ^hp unvoiced spcctrum portions is flexible and adaptively 

curre nt Frame by looking at ths^ scoiience nf prfivions pitch adjusted for each signal se gment^ . ^ 

values7P^P ^ ) and the pitch valu^ (?^) fonthe first future /US2fi*h-rcterence 16 tlLr. STthe determination of the voicing 
frameTInlhis case, if P_2, P_i anJPj are smoothly varied l probability, along with a refinement of the pitch estimate is 
frtJS'^ine to another, any jump in the estimate of the pitch Pq sl accomplished as follows. In step 205 of the method, the 
of the current frame away from the path established in the spectrum of the speech segment at the standard sampling 
other frames indicates the possibility of an error which may ' frequency f^ is computed using an N-point FFT. (It should be 
be corrected by comparing the estimate Pq to the stored pitch noted that the pitch estimate can be computed cither from the 
values of the adjacent frames, and "smoothing" the function input signal, or from the excitation signal on the output of 
which connects all pitch values. Such a pitch smoothing 55 block 30_in^CL^. ^ 

procedure which is known in the art improves the synlhe- "In~the next block 270 the following method steps take 
sized speech significantly. place. First, a set of pitch candidates are selected on a refined 

While the pitch detection was described above with spectrum grid about the initial pitch estimate. In a preferr ed 
reference to a specific preferred embodiment which operates embodiment, about 10 differe nt candidates are selected 
in the frequency domain, it should be noted that other pitch 60 within t1ie^requ_cncy_j:angei^ toJP+1 o£the JmtiaI^it c^ 
detectors can be used in block 40 (FIG. 2) to estimate the estimateT/The corresponding harmonic coefiGcients A,- for 
fundamental frequency of the signal in each segment. each of the refined pitch candidates are determined next 
Specifically, an autocorrelation or average magnitude dif- firom the signal spectrum S^X^) and are stored. Next, a 
ference function (AMDF) detectors that operate in the time synthetic speech spectrum is created about each pitch can- 
domain, or a hyEnd deieClbr ' that operates both in the time 65 didate based on the assumption that the speech is purely 
and the frequency domain can be also be employed for that voiced. The synthetic speech spectrum S(w) can be com- 
purpose. puted as: 



Pj_ \ (16) 
k 

' ;jt-2,3,, 
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between the actual and the synthetic spectra is computed and 

H (17) stored for each bin and then compared to a frequency- 

siw) - K^o)| • sinc(»v - Awo) dependent adaptive threshold. Frequency bins in which the 

error exceeds the threshold are determined to be unvoiced, 

where |S(ka)o)| is the original speech spectrum magnitude ^ ^^^^ ^^^^^ ^^^^ j^ss than the threshold are 

sampled at the harmonics of the pitch Fq, H is the number considered to be voiced. 

of harmonics and: Unlike prior an solutions in which each frequency bin is 

processed on the basis of the voiced/unvoiced decision, in 

3mj(u jtiiQ- ^^^^^'^^^ ^^^^ accordance with a preferred embodiment of the present 

ln{w-kw^ 30 invention the entire signal spectrum is separated into two 

is a sine function which is centered around each harmonic of bands. It has been determined experimentaUy that usually 

the fundamental frequency. *he low frequency band of the signal spectrum represents 

The original and synthetic excitation spectra correspond- voiced speech, while the high frequency band represents 

ing to each harmonic of fundamental frequency are then unvoiced signal. This observation is used m the system of 

compared on a point-by-point basis and an error measure for 15 the present invention to provide an approximate solution to 

each value is computed and stored. Due to the fact that the ^"^^ Problem of separating the signal into voiced and 

synthetic spectrum is generated on the assumption that the unvoiced bands, in which the boundary between voiced and 

speech is purely voiced, the normalized error wUl be rela- unvoiced spectrum bands is determmed by the ratio between 

tively small in frequency bins corresponding to voiced the number of voiced harmonics withm the spectrum of the 

harmonics, and relatively large in frequency bins corre- 20 signal and the total number of frequency harmomcs, i.e. 

spending to unvoiced portions of the signal. Thus, in accor- ^sing the expression: 

dance with the present invention the normalized error for the ^ 

frequency bin around each harmonic can be used to decide IL 

whether the signal in a bin is predominantly voiced or ^ 

unvoiced. To this end, the normalized error for each har- 25 ^^ere is the number of voiced harmonics that are 

monic bin is compared to a frequency -dependent threshold. estimated using the above procedure and H is the total 

The value of the threshold is determined in a way such that number of frequency harmonics for the entire speech spec- 

aproper mix of voiced and unvoiced energy can be obtained. trum. Accordingly, the voicing cut-off frequency is then 

Ilie frequency -dependent, adaptive threshold can be calcu- computed as: 

lated using the following sequence of steps: w^-p-ti (22) 

1. Compute the energy of a speech signal. ^ \ 1 u ^ ^ r .u . * ♦u 
^ ^ , . , ^ u ■ I which defines the border frequency that separates the 

2. Compute the long term average speech signal energy ^ I .^Tn,,^-^ 

• . unvoiced and voiced portion of speech spectrum. The voic- 

using the expression: .^^ probability Pv is supplied on output to block 280 in FIG. 

35 5. Finally, in block 290 in FIG. 5 is computed the power 

Uo(«) + W«-i)] . ^ - 1) spectrum P^. of the harmonics. 

^*>v,(«) 2.0 ' ^ 2.5 Excitation Spectrum Band Energies 

a • Za,g{n - 1) + pZo("); otherwise Dependent on the required bit rale for the overall system, 

in accordance with the present invention two separate meth- 

where Zo(n) is the energy of the speech signal. 40 ods can be used to encode the energy of the excitation 

3. Compute the threshold parameter using the expression: spectrum. In a first preferred embodiment directed to very 

low bit rate systems, a single parameter corresponding to the 
(y • (n) + zo(n)) f^^) energy of the excitation spectrum is stored or transmitted. 

(ji-z^{n) + zd(ny) Specifically, if the total energy of the excitation signal is 

equal to E, where 



, Compute the adaptive, frequency dependent threshold 
function: ^ 



T,{wyrAaw+b] (20) 



1=0 



where the parameters a, P,Y,/^ a and bare constants that can 50 ^^^^^ ^^^^^ ^^^^.^^^ 

be determined by subjective tests usmg a group of listeners ^ .^^^^^ ^^^^^ ^^^^^^^ ^ 2), it has 

whichcan indicate aperceptuallyoptimumratioofvoiced to determined that L harmonics of the pitch are present, 

unvoiced energy. In this case, if the nomaalized error is less ^ ^^ ^^^^^ ^^^^^^^^ ^ ^^^^ transmitted: 
than the value of the frequency dependent adaptive threshold ^ f r 

function, T„(w), the corresponding frequency bin is then 55 ^22) 



determined to-be voiced; otherwise it is treated as being 
unvoiced. 



In summary, in accordance with a preferred embodiment In an alternative preferred embodiment, in order to pro- 
of the present invention the spectrum of the signal for each vide more flexibility in coding the excitation spectral mag- 
segment is divided into a number of frequency bins. The 60 nitude information, the whole spectrum is divided into a 
number of bins corresponds to the integer number obtain by certain number of bands (between about 8 to 10) and the 
computing the ratio between half the sampling frequency 4 average energy for each band is computed from the har- 
and the refined pitch for the segment estimated in block 270 monic magnitudes that fall in the corresponding band, 
in FIG. 5. Next, a synthetic speech signal is generated on the Preferably, frequency bands in the voiced portion of the 
basis of the assumption that the signal is completely voiced, 65 spectrum can be separated using linearly spaced frequencies 
and the spectrum of the synthetic signal is compared to the while bands that fall within the unvoiced portion of the 
actual signal spectrum over all frequency bins. The error spectrum can be separated using logarithmically spaced 
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frequencies. These band energies are then quantized and 
transmitted to the receiver side, where the spectral magni- 
tude envelope is reconstructed by linearly interpolating 
between the band energies. 

A. 2.6 Quantization 5 
In accordance with a preferred embodiment of the present 

invention, output parameters from the encoding block 5 are 
finally quantized for subsequent storage and/or transmission. 
Several algorithms can be used to that end, as known in the 
art. In a specific embodiment, the LPC coefficients repre- lo 
senting the model of the signal spectrum are first trans- 
formed to fine spectrum coefficients (LSF). Generally, LSFs 
encode speech spectral information in the frequency domain 
and have been found to be less sensitive to quantization than 
the LPC coefficients. In addition, LSFs lend themselves to 15 
frame-to-frame interpolation with smooth spectral changes 
because of their close relationship with the formant frequen- 
cies of the input signal. This feature of the LSFs is used in 
the present invention to increase the overall coding effi- 
ciency of the system because only the difference between 20 
LSF coefficient values in adjacent frames need to be trans- 
mitted in each segment. The LSF transformation is known in 
the art and will not be considered in detail here. For 
additional information on the subject one can consult, for 
example, Kondoz, "Digital Speech: Coding for Low Bit 25 
Rate Communication Systems," John Wiley & Sons, 1994, 
the relevant portions of which are hereby incorporated by 
reference. 

The quantized output LSF parameters are finally supplied 
to an encoder to form part of a data packet representing the 30 
speech segment for storage and transmission. In a specific 
embodiment of the present invention directed to a 2.4 kb/s 
system, 31 bits are used for the transmission of the model 
spectrum parameters, 4 bits are used to encode the voicing 
probability, 8 bits are used to represent the value for the 35 
pitch, and about 5 bits can be used to encode the excitation 
spectrum energy parameter. 

B. The Decoder 

FIG. 3 shows in a schematic blockKliagram foma the 
decoder used in accordance with a preferred embodiment of 40 
the present invention. As indicated in the figure, the voiced 
portion of the excitation signal is generated in block 50; the 
unvoiced portion of the excitation signal is generated sepa- 
rately in block 60, both blocks receiving on input the voicing 
probability Pv, the pitch Pq, and the excitation energy 45 
parameter(s) U^, The output signals from blocks 50 and 60 
are added in adder 55 to provide a composite excitation 
signal. On the other hand, the encoded model spectrum 
parameters are used to initiate the LPC interpolation filter 
70. Finally, frequency domain post-filtering block 80 and 50 
LPC synthesis block 90 cooperate the re-construct the 
original input signal, as discussed in more detail next. 

The operation of unvoiced excitation synthesis block 60 is 
illustrated in FIG. 9 and can briefly be described as taking 
the short time Fourier transform (STFT) of a white noise 55 
sequence and zeroing out the frequency regions marked in 
accordance with the voicing probability parameter Pv as 
being voiced. The synthetic unvoiced excitation can then be 
produced from an inverse STFT using a weighted overlap- 
add method. The samples of the unvoiced excitation signal 60 
are then normalized to have the desired energy level a. With 
reference to FIG. 9, a white Gaussian noise sequence is 
generated in block 630 and is transformed in the frequency 
domain in FFT block 620. The output from block 620 is then 
used, in high pass filter 610, to synthesize the unvoiced part 65 
of excitation on the basis of the voicing probability of the 
signal. Since the voiced portion of speech spectmm (low 
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frequencies) is processed by another algorithm, a high pass 
filter in frequency domain is used to simply zero out the 
voiced components of the spectrum. 

Next, in block 640, the frequency components which fall 
above the voicing cut-off frequency are normalized to their 
corresponding band energies. Specifically, with reference to 
the single -excitation energy parameter example considered 
above, the normalization p is computed from the transmitted 
excitation energy A, the total number of harmonics L, as 
determined by the pitch, and the number of voiced harmon- 
ics Lv, determined from the voicing probability Pv, as 
follows: 

where En is the energy of the noise sequence at the output 
of block 630. 

The normalized noise sequence is next inverse Fourier 
transformed in block 650 to obtain a time-domain signal. In 
order to eliminate discontinuities at the frame edges, the 
synthesis window size is generally selected to be longer than 
the speech update size. As a result, the unvoiced excitation 
for each frame overlaps that of neighboring frames which 
eliminates the discontinuity at the frame boundaries. A 
weighted overlap-add procedure is therefore used in block 
660 to process the unvoiced part of the excitation signal. 

In a preferred embodiment of the present invention, 
blocks 630, 620 and 630 can be combined in a single 
memory block (not shown) which stores a set of pre-filtered 
noise sequences. In particular, stored as codebook entries are 
several pre-computed noise sequences which represent a 
lime-domain signal that corresponds to different "uiivoiced" 
portions of the spectmm of a speech signal. In a specific 
embodiment of the present invention, 16 different entries can 
be used to represent a whole range of unvoiced excitation 
signals which correspond to such 16 different voicing prob- 
abihties. For simplicity it is assumed that the spectrum of the 
original signal is divided into 16 equal -width portions which 
correspond to those 16 voicing probabilities. Other 
divisions, such as a logarithmic frequency division in one or 
more parts of the signal spectrum, can also be used and are 
determined on the basis of computational complexity con- 
siderations or some subjective performance measure for the 
system. 

FIG. 8 is a block diagram of the voiced excitation 
synthesis algorithm in accordance with a preferred embodi- 
ment of the present invention. As shown, block 550 receives 
on input the pitch, the voicing probability Pv, and the 
excitation band energies. The voiced excitation is repre- 
sented using a set of sinusoids harmonically related to the 
pitch. In a specific embodiment of the present invention in 
which only the total energy of the excitation signal has been 
transmitted, the amplitudes of all harmonic frequencies are 
assumed to be equal. Conditions for amplitude and phase 
continuity at the boundaries between adjacent frames can be 
computed, as shown for example in copending U.S. patent 
application Ser. No. 08/273,069 to one of the co-inventors of 
the present application. The content of this application is 
hereby expressly incorporated for all purposes. 

In an alternative embodiment of the present invention 
directed to the general case when more than one excitation 
band energies are transmitted, the voiced excitation is rep- 
resented as a sum of harmonic sinusoids of the pitch as: 



03/12/2004, EAST version: 1.4.1 



5,85 

19 

L 

where a(t) is the interpolated average harmonic excitation 
energy ftinction and is the phase function of the 

excitation harmonics. The harmonic amplitudes are obtained 
by linearly interpolating the band energies and sampling the 
interpolated energies at the harmonics of the pitch fre- 
quency. Furthermore, the excitation energy function is lin- 
early interpolated between frames, with the harmonics cor- 
responding to the unvoiced portion of the spectrum being set 
to zero. The phase function of the speech signal is deter- 
mined by the initial phase which is completely predicted 
using previous frame information and linear frequency track 
Wjt(t). To determine the phase of the excitation signal, the 
phases of the speech signal and the LPC inverse filter are 
added together to form the excitation phase as: 

i;.t(f)-et(fH6tC0 

where bj^i) is the phase of LPC inverse filter corresponding 
to the k-th frequency track at time t. As the phase function 
%^\) is dependent on the initial phase and the frequency 
deviation Aw^, the parameters and Aw^ are chosen so that 
the principal values of 0^0) and ei(-N) are equal to the 
predicted harmonic phases in the current and the previous 
frame, respectively. 

When k harmonics of the current and previous frames fall 
within the voiced portion of the spectrum, the initial phase 
(t)Q set to the predicted phase of the current frame and ^^^^ is 
chosen to be the smallest frequency deviation required to 
match the phase of the previous frame. When either of the 
corresponding harmonics in two adjacent frames is declared 
unvoiced, only the initial phase parameter is required to 
match the phase function 0^(t) with the phase of the voiced 
harmonic (Aco^. is set to zero). When corresponding harmon- 
ics in adjacent frames both fall within the unvoiced portion 
of the spectrum, the function a(t) is set to zero over the 
entire interval between frames, so that a random phase 
function can be used. Large differences in fundamental 
frequency can occur between adjacent frames due to word 
boundaries and other effects. In these cases, linear interpo- 
lation of the fundamental frequency between frames is a 
poor model of the pitch variation, and can lead to artifacts 
in the synthesized signal. Consequently, when pitch fre- 
quency changes of more than about 10% are encountered 
between adjacent frames, the harmonics in the voiced por- 
tion of the spectrum for the current frame and the corre- 
sponding harmonics in the previous frame are treated as if 
followed and preceded, respectively, by unvoiced harmon- 
ics. 

C. Speech Enhancement 

Several techniques, including LPC interpolation and fre- 
quency domain post-filtering have been developed to 
improve subjectively the output speech quality of speech 
coder in accordance with a preferred embodiment of the 
present invention. 

CI LPC Interpolation 

In addition to the order p of the LPC analysis used, as 
known in the art, the frame by frame update of the LPC 
analysis coefficient determines the degree of accuracy with 
which the LPC filter can model the spectrum of the speech 
signal. Thus for example, during sustained regions of slowly 
changing spectral characteristics, the frame by frame update 
can cope reasonably well. However, in transition regions 
which are believed to be perceptually more important, it will 
fail as transitions fall within a single frame and thus carmot 
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be represented accurately. During such transition intervals, 
the calculated set of parameters will only represent an 
average of the changing shape of the spectral characteristics 
of that speech frame. To model the transitions more 

5 accurately, in accordance with a preferred embodiment of 
the present invention, the update rate of the analysis is to be 
increased so that the frame length is much larger than the 
number of new samples used per frame, i.e. the window is 
spread across past, current and future samples. 

As those skilled in the art will appreciate, the disadvan- 
tages of this technique are that greater algorithmic delay is 
introduced; if the shift of the window (i.e. number of new 
samples used per update) is small, the coding capacity is 
increased; and if the shift of the window is long, although the 
coding capacity is decreased, the accuracy of the excitation 

^5 modelling also decreases. Therefore, a irade-off is required 
between accurate spectral modelling, excitation modelling, 
delay and coding efSciency. In accordance with a preferred 
embodiment, one approach to satisfying this tradeoff is the 
use of frame -to -frame LPC interpolation. Generally, the idea 

20 is to achieve an improved spectrum representation by evalu- 
ating intermediate sets of parameters between frames, so that 
transitions are introduced more smoothly at the frame edges 
without the need to increase the coding capacity. The 
interpolation type can either be linear or nonlinear. 

25 As the LPC coefficients in accordance with the present 
invention are quantized in the form of LSFs, it is preferable 
to linearly interpolate the LSF coefficients across the frame 
using the previous and current frame LSF coefficients. 
Specifically, if the time between two speech frames corre- 
sponds to N samples, the LSF interpolation function is given 

^° by 

LSFi^ji) - lsf„.,{k) + [lsf^{k) - lsf„,,{k)] 

where lsf„(K) corresponds to the Kth LSF coefficient in the 
„ frame and 0^n<N. The interpolated LSFs are then con- 
verted to LPC coefficients, which will be used in the LPC 
synthesis filter. This interpolation procedure automatically 
tracks the form ants and valleys from one formant to another, 
^ which makes the output speech smoother It was found that 
the improvement due to the LPC interpolation is in all cases 
very noticeable. The smoothness of the processed speech 
was considerably enhanced, while speech from faster speak- 
ers was noticeably improved. However, sample -by-sample 
LPC interpolation is computationally very expensive. 
Therefore, the speech frame is broken into five or six 
sub frames requiring five or six interpolation points in the 
center of each. This reduces the computational complexity 
of the algorithm considerably, while producing ahnost iden- 
tical speech quality. 

C,2 Frequency Domain Post-Filtering 

Referring back to FIG. 3, in accordance with a preferred 
embodiment of the present invention a post- filter 80 is used 
to shape the noise and improve the perceptual quality of the 
synthesized speech. Generally, in noise shaping, lowering 
noise components at certain frequencies can only be 
achieved at a price of increased noise components at other 
frequencies. As speech formants are much more important to 
perception than the formant nulls, the idea is to preserve the 
formant information by keeping the noise in the formant 
regions as low as possible. The first step in the design of the 
frequency domain postfilter is to weight the measured spec- 
tral envelope 

/?Ja>)=//(a>)ina)) 

in order to remove the spectral tilt and produce an even, i.e., 
more flat spectrum. In the expression above, H(co) is the 
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measured spectral envelope (See FIG. 10 A) and W{a)) is the D. Applications 

weighting function, represented as The method and system of the present invention described 

above in a preferred embodiment using 2.4 kb/s can in fact 

w(u)- -1+ f Qki^-i^ provide the capability of accurately encoding and synthe- 

^i^'i) fc-i 5 sizing speech signals for a range of user-specific applica- 
tions. Because of the modular structure of the system in 

where the coefficientYisbetweenO and 1, and the frequency which different portions of the signal spectrum can be 

response H(a)) of the LPC filter can be computed as: processed separately using different suitably optimized 

algorithms, the encoder and decoder blocks can be modified 

] 10 to accommodate specific user needs, such as different system 

£ ~^ bit rates, by using different signal processing modules. 

^*jfe=i°*^ Furthermore, in addition to straight speech coding, the 

analysis and synthesis blocks of the system of the present 

where a, is the coefficient of a pth order all-pole LPC filter invention can also be used in speech enhancement, recog- 

and Y is rhe weighting coefficient, which is typically 0.5. See """"^ f '^e generation of voice effects. Furthermore, 

FIG 7. The weighfed spectral envelope, RJco) is then '""l^l^^^^l ^^'''^^^'^ method of he present invention, 

normaUzed to havi unity gain, and taken to the power of p, ^.''"^h are based on voicing probability determination pro- 

.... f ui AO TfD iul ^.^'.^.r^ Vide natural sounding Speech which can be used m artificial 

which IS preferably set equal to 0.2. If R^o^ is the maximum ik ■ f . • 

value of the weighted spectral envelope, the postfilter is syrUhcsis ot a user s voice 

taken to be r r -j^^ method and system of the present invention may also 

be used to generate a variety of sound effects. Two different 

p types of voice effects are considered next in more detail for 

/ ^oi(to) \ . Q < g < J illustrative purposes. The first voice effect is what is known 

" \ Rmnx ^ ' = P = • in the art as time stretching. This type of sound effect may 

25 be created if the decoder block uses synthesis frame sizes 

The idea is that, at the formant peaks, the normalized different from that of the encoder. In such case, the synthe- 

weighted spectral envelope will have unity gain and will not sized time segments are expanded or contracted in time 

be altered by the effect of p. This will be true even if the compared to the originals, changing the rate of playback. In 

low-frequency formants are significantly higher than those the system of the present invention this effect can easily be 

at the high-fi-equency end. The value of the parameter p 30 accomplished simply by using, in the decoder block 8, of 

controls the distance between formant peaks and nulls, so different values for the firame length N and the overlap 

that, overall, a Wiener-type filter characteristic will result portion between adjacent frames. Experimentally it has been 

(See FIG. lOB). The estimated postfilter frequency response demonstrated that the output signal of the present system can 

is then used to weight the original speech envelope to give be effectively changed with virtually no perceptual degra- 

35 dation by a factor-of about five in each direction (expansion 

//(u))=i'/a))//(o)) or contraction). Thus, the system of the present invention is 

capable of providing a naturally sounding speech signal over 

This causes the formants to narrow nd reduces the depth of a range of applications including dictation, voice scanning, 

the formant nulls, thereby reducing the effects of the noise and others. (Notably, the perceptual quality of the signal is 

without introducing a spectral tiU in the spectrum, which is 40 preserved because the fundamental frequency Fg and the 

very common in pole -zero postfilters. (See FIG. IOC) When general position of the speech formants in the spectrum of 

applied to the decoder part of the system in accordance with the signal is preserved). 

the present invention, it has been observed that the resuhing in addition, changing the pitch frequency Fq and the 

system produces much improved speech quality. The post- harmonic amplitudes in the decoder block will have the 

filtering steps used in accordance with a specific embodi- 45 perceptual effect of altering the voice personality in the 

ment of the present invention are illustrated in FIG. 7. synthesized speech with no other modifications of the sys- 

C.3 Synthesizing the Final Speech Output tem being required. Thus, in some applications while retain- 

With reference to FIG. 3, after synthesizing the LPC ing comparable levels of intelligibility of the synthesized 

excitation signal on the output of block 55, and applying the speech the decoder block of the present invention may be 

enhancement techniques discussed above on the synthesized 50 used to generate different voice personalities. Specifically, in 

LPC excitation, a LPC synthesis filtering is performed using a preferred embodiment, the system of the present invention 

the interpolated LPC parameters by passing the excitation is capable of generating a signal in which the pitch corre- 

through the LPC filter 90 to obtain the final synthesized spends to a predetermined target value F07-. A simple mecha- 

speech signal. nism by which this voice effect can be accomplished can be 

Decoder block 8 has been described with reference to a 55 described briefly as follows. Suppose for example that the 

specific preferred embodiment of the system of the present spectrum envelope S(co) of an actual speech signal and the 

invention. As discussed in more detail in Section A above, fundamental frequency Fq and its harmonics have given 

however, the system of this invention is modular in the sense values. Using the system of the present invention the model 

that different blocks can be used for encoding of the voiced spectrum S(a)) can be generated firom the reconstructed 

and unvoiced portions of the signal dependent on the appli- 60 output signal. (Notably, the pitch period and its harmonic 

cation and other user-specified criteria. Accordingly, for frequencies are directly available as encoding parameters), 

each specific embodiment of the encoder of the system. Next, the continuous spectrum S(a)) can be re-sampled to 

corresponding changes need to be made in the decoder 8 of generate the spectrum amplitudes at the target fundamental 

the system for synthesizing output speech having desired frequency Fqt- and its harmonics. In an approximation, such 

quantitative and perceptual characteristics. Such modifica- 65 re-sampling, in accordance with a preferred embodiment of 

tions should be apparent to a person skilled in the art and will the present invention, can easily be computed using linear 

not be discussed in further detail. interpolation between the amplitudes of adjacent harmonics. 
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Next, at the synthesis block, instead of using the originally 
received pitch Fq and the amplitudes of its harmonics, one 
can use the target values obtained by interpolation, as 
indicated above. This pitch shifting operation has been 
shown in real time experiments to provide perceptually very 5 
good results. Furthermore, the system of the present inven- 
tion can also be used to dynamically change the pitch of the 
reconstructed signal in accordance with a sequence of target 
pitch values, each target value corresponding to a specified 
number of speech frames. The sequence of target values for lO 
the pitch can be pre-programmed for generation of a specific 
voice effect, or can be interactively changed in real time by 
the user. 

It should further be noted that while the method and 
system of the present invention have been described in the 15 
context of a specific speech processing environment, they 
are also applicable in the more general context of audio 
processing. Thus, the input signal of the system may include 
music, industrial sounds and others. In such case, dependent 
on the application, it may be necessary to use sampHng 20 
frequency higher or lower than the one used for speech, and 
also adjust the parameters of the filters in order to adequately 
represent all relevant aspects of the input signal. 
Furthermore, harmonic amplitudes corresponding to differ- 
ent tones of a musical instrument can also be stored at the 25 
decoder of the system and used independently for music 
synthesis. Compared to conventional methods, music syn- 
thesis in accordance with the method of the present inven- 
tion has the benefit of using significantly less memory space 
as well as more accurately representing the perceptual 30 
spectral content of the audio signal. 

In accordance with the present invention the low bit rate 
system of the present invention can be used in a variety of 
other applications, including computer and multimedia 
games, transmission of documents with voice signatures 35 
attached, Internet browsing, and others, where it is important 
to keep the bit rate of the system relatively low, while the 
quality of the output speech patters need not be very high. 
Other applications of the system and method of the present 
invention will be apparent to those skilled in the art. 40 

While the invention has been described with reference to 
a preferred embodiment, it will be appreciated by those of 
ordinary skill in the art that modifications can be made to the 
structure and form of the invention without departing from 
its spirit and scope which -is defined in the following claims. 45 
An alternative description of the system and method of the 
present invention which can assist the reader in understand- 
ing specific aspects the invention is attached. 

What is claimed is: 

1. A method for processing an audio signal comprising: 50 
dividing the signal into segments, each segment repre- 
senting one of a succession of time intervals; 
computing for each segment a model of the signal in such 
segment; 

subtracting the computed model from the original signal 

to obtain a residual excitation signal; 
detecting for each segment the presence of a fundamental 

frequency Fq; 

determining for the excitation signal in each segment a go 
ratio between voiced and unvoiced components of the 
signal in such segment on the basis of the fundamental 
frequency Fq, said ratio being defined as a voicing 
probability Pv; 

separating the excitation signal in each segment into a 65 
voiced portion and an unvoiced portion on the basis of 
the voicing probability Pv; and 
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encoding parameters of the model of the signal in each 
segments and the voiced portion and the unvoiced 
portion of the excitation signal in each segment in 
separate data paths. 

2. The method of claim 1 wherein the audio signal is a 
speech signal and detecting the presence of a fundamental 
frequency Fq comprises computing the spectrum of the 
signal in a segment. 

3. The method of claim 2 wherein the voiced portion of 
the signal occupies the low end of the spectrum and the 
unvoiced portion of the signal occupies the high end of the 
spectrum for each segment. 

4. The method of claim 1 wherein computing a model 
comprises modeling the spectrum of the signal in each 
segment as the output of a linear time-varying filter. 

5. The method of claim 4 wherein modeling the spectrum 
of the signal in each segment comprises computing a set of 
linear predictive coding (LPC) coefiBcieols and encoding 
parameters of the model of the signal comprises encoding 
the computed LPC coefiScients. 

6. The method of claim 5 wherein encoding the LPC 
coefficients comprises computing line spectral frequencies 
(LSF) coefficients conesponding to the LPC coefficients and 
encoding of the computed LSF coefiScients for subsequent 
storage and transmission. 

7. The method of claim 1 further comprising: forming one 
or more data packets corresponding to each segment for 
subsequent transmission or storage, the one or more data 
packets comprising: the fundamental frequency Fq, data 
representative of the computed model of the signal, and the 
voicing probability Pv for the signal. 

8. The method of claim 7 further comprising: receiving 
the one or more data packets; and synthesizing audio signals 
from the received one or more data packets data packets. 

9. The method of claim 8 wherein synthesizing audio 
signal comprises: 

decoding the received one or more data packets to extract: 
the fundamental frequency, the data representative of 
the computed model of the signal and the voicing 
probability Pv for the signal. 

10. The method of claim 9 further comprising: 
synthesizing an audio signal from the extracted data, 

wherein the low fi-equency band of the spectrum of said 
synthesized audio signal is synthesized using data 
representative of the voiced portion of the signal; the 
high frequency band of the spectrum of said synthe- 
sized audio signal is synthesized using data represen- 
tative of the unvoiced portion of the signal and the 
boundary between the low frequency band and the high 
frequency band of the spectrum is determined on the 
basis of the decoded voicing probability Pv. 

11. The method of claim 10 wherein the audio signal 
being synthesized is a speech signals and synthesizing 
further comprises: 

providing amplitude and phase continuity on the bound- 
ary between adjacent synthesized speech segments. 

12. A system for processing an audio signal comprising: 
means for dividing the signal into segments, each segment 

representing one of a succession of time intervals; 
means for computing for each segment a model of the 

signal in such segment; 
means for subtracting the computed model from the 

original signal to obtain a residual excitation signal; 
means for detecting for each segment the presence of a 

fundamental frequency Fq; 
means for determining for the excitation signal in each 

segment a ratio between voiced and unvoiced compo- 
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nents of the signal in such segment on the basis of the 21. A method for synthesizing audio signals from one or 

fundamental frequency Fq, said ratio being defined as a more data packets representing at least one time segment of 

voicing probability Pv; a signal, the method comprising: 

means for separating the excitation signal in each segment decoding said one or more data packets to extract data 

into a voiced portion and an unvoiced portion on the 5 comprising: a fundamental frequency parameter, 

basis of the voicing probabiHty Pv; and parameters representative of a spectrum model of the 

means for encoding parameters of the model of the signal signal in said at least one time segment, and a voicing 

in each segments and the voiced portion and the probability Pv defined as a ratio between voiced and 

unvoiced portion of the excitation signal in each seg- unvoiced components of the signal in said at least one 

ment in separate data paths. jq segment; 

13. The system of claim 12 wherein the audio signal is a generating a set of harmonics H corresponding to said 
speech signal and the means for detecting the presence of a fundamental frequency, the amplitudes of said harmon- 
fundamental frequency Fq comprises means for computing ics being determined on the basis of the model of the 
the spectrum of the signal. signal, and the number of harmonics being determined 

14. The system of claim 13 further comprising: means for the basis of the decoded voicing probability Pv; and 
computing LPC coefficients for a signal segment; and synthesizing an audio signal using the generated set of 

means for transfonming LPC coefficients into line spectral harmonics 

frequencies (LSF) coefficients corresponding to the 22. The method of claim 21 wherein the model of the 

.^"^^^^ ^^^°^r 1 ■ t- • J f signal is an LPC model, the extracted data further comprises 

15. The system of claim 12 wherein said means for a gain parameter, and the a mpUtudes of said harmonics are 
determmmg a ratio between voiced and unvoiced compo- T . • \. • * u r *u mr- 

t fi rth * • determined using the gam parameter by sampling the LPC 

nenis runner comprises. ^ ii^i • r^Lrj ^ir 

r . -I, - J *u * c spectrum model at harmonics 01 the fundamental trequency. 

means for generatmg a fiiUy voiced synthetic spearum o 1^ ^^^^ ^ ^^^^^.^ ^^^.^ ^ ^ 

a signal corresponding to the detected tundaraental , , . _ , , . * , 

freauencY F • speech and generating a set of harmonics comprises apply- 
- °' . - , c 7s ing a frequency domain filtering to shape the LPC spectmm 

means for evaluatmg an error measure for each frequency . • *u . i i % p *u Tu ^ 

^ , . f 1. /. J . , as to improve the perceptual quality of the synthesized 

bm correspondmg to harmonics of the fundamental r r i 

frequency in the spectrum of the signal; and ^P^/^ Ihc method of claim 23 wherein the frequency 
means for determimng the voicmg probabihty Pv of the ^^^^^ ^^^^^^ j.^ ^ accordance with the expression 
segment as the ratio of harmonics for which the evalu- 
ated error measure is below certain threshold and the p 
total number of harmonics in the spectrum of the signal. _ / RJpj) \ . q < g < ^ 

16. The system of claim 12 further comprising: ^ " \ R^^ j , - - ■ 
means for forming one or more data packets correspond- ^^^^ 

ing to each segment for subsequent transmission or 

storage, the one or more data packets comprising: the /?^(<o) -//(co)w^(o) 
fundamental frequency Fq, data representative of the ^tiere 
computed model of the signal, and the voicing prob- 
ability Pv for the signal. /?Ja))-//((o)w'Co)) 

17. The system of claim 16 further comprising: in which W(co) is the weighting function, represented as 
means for receiving the one or more data packets over 

communications medium; and ^^^^ ^ i » j + £ ajtY*c~^ 

means for synthesizing audio signals from the received ^P^ST 
one or more data packets data packets. 

18. The system of claim 17 wherein said means for 45 the coefficient 7 is between 0 and 1, and the frequency 
synthesizing audio signals comprises: response H(a)) of the LPC filter is given by: 

means for decoding the received one or more data packets 

to extract: the fundamental frequency, the data repre- /f((o) = ^ 

sentative of the computed model of the signal and the ^ _^ £ ^^.-^ 
voicing probabihty Pv for the signal. 50 

19. The system of claim 18 further comprising: 

means for synthesizing an audio signal from the extracted where ayg is the coefficient of a pth order all-pole LPC filter, 

data, wherein the low frequency band of the spectrum y is the weighting coefficient, and R^^ is the maximum 

of said synthesized audio signal is synthesized using value of the weighted spectral envelope, 
data representative of the voiced portion of the signal; 55 25. The method of claim 22 wherein said parameters 

the high frequency band of the spectrum of said syn- representative of a spectrum model are LSF coefficients 

thesized audio signal is synthesized using data repre- corresponding to the LPC spectrum model, 

sentative of the unvoiced portion of the signal and the 26. The method of claim 25 wherein synthesizing an 

boundary between the low frequency band and the high audio signal comprises linearly interpolating LSF coeffi- 
frequency band of the spectrum is determined on the 60 cients across a current segment using LSF coefficients from 

basis of the decoded voicing probability Pv. the previous segment as to increase the accuracy of the 

20. The system of claim 19 wherein the audio signal being signal synthesis. 

synthesized is a speech signals and synthesizing further 27. ITie method of claim 26 wherein linear interpolating 

comprises: LSF is applied at two or more subsegments of the signal, 
means for providing amphtude and phase continuity on 65 28. A method for synthesizing audio signals from one or 

the boundary between adjacent synthesized speech more data packets representing at least one time segment of 

segments. a signal, the method comprising: 
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decoding said one or more data packets to extract data 
comprising: a fundamental frequency parameter, 
parameters representative of a spectrum model of the 
signal in said at least one time segment, one or more 
parameters representative of a residual excitation signal 
associated with said spectrum model of the signal, and 
a voicing probability Pv defined as a ratio between 
voiced and unvoiced components of the signal in said 
at least one time segment; 

providing a filter, the frequency response of which cor- 
responds to said spectrum model of the signal; and 

synthesizing an audio signal by passing a residual exci- 
tation signal through the provided filter, said residual 
excitation signal being generated from said fundamen- 
tal frequency, said one or more parameters representa- 
tive of a residual excitation signal associated with said 
spectrum model of the signal, and the voicing prob- 
ability Pv. 
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29. The method of claim 28 wherein the provided filter is 
a LPC filter, and said one or more parameters representative 
of a residual excitation signal comprises a gain parameter. 

30. The method of claim 28 wherein the audio signal is 
5 speech and synthesizing an audio signal comprises applying 

frequency domain filtering to shape the residual excitation 
signal as to improve the perceptual quality of the synthe- 
sized speech. 

31. The method of claim 28 wherein said parameters 
representative of a spectrum model are LSF coefiBcients 
corresponding to a LPC spectrum model. 

32. The method of claim 31 wherein synthesizing an 
audio signal comprises linearly interpolating LSF coefiB- 
cients across a current segment using LSF coefiBcients from 
the previous segment as to increase the accuracy of the 
signal synthesis. 

♦ 4> * f >^ 
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